Class 13 | March 8, 2024

load data

library(Stat2Data)
data("RailsTrails")
colnames(RailsTrails)
##  [1] "HouseNum"     "Acre"         "AcreGroup"    "Adj1998"      "Adj2007"     
##  [6] "Adj2011"      "BedGroup"     "Bedrooms"     "BikeScore"    "Diff2014"    
## [11] "Distance"     "DistGroup"    "GarageSpaces" "GarageGroup"  "Latitude"    
## [16] "Longitude"    "NumFullBaths" "NumHalfBaths" "NumRooms"     "PctChange"   
## [21] "Price1998"    "Price2007"    "Price2011"    "Price2014"    "SFGroup"     
## [26] "SquareFeet"   "StreetName"   "StreetNum"    "WalkScore"    "Zip"

fit the two models

model 1: A model that predicts the 2014 sale price based on an additive combination of the homes distance from the bike path, and it’s size in square feet.

model1 = lm(Price2014 ~ Distance + SquareFeet, data = RailsTrails)
summary(model1)
## 
## Call:
## lm(formula = Price2014 ~ Distance + SquareFeet, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152.15  -30.27   -4.14   25.75  337.93 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   78.985     25.607   3.085  0.00263 ** 
## Distance     -15.788      7.586  -2.081  0.03994 *  
## SquareFeet   147.920     12.765  11.588  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.55 on 101 degrees of freedom
## Multiple R-squared:  0.6574, Adjusted R-squared:  0.6506 
## F-statistic: 96.89 on 2 and 101 DF,  p-value: < 2.2e-16

model 2: A model that predicts the 2014 sale price based on the homes distance from the bike path, it’s size in square feet, and the number of full bathrooms the home has, allowing for the effect of distance to depend on a home’s size

model2 = lm(Price2014 ~ NumFullBaths + Distance * SquareFeet, data = RailsTrails)
summary(model2)
## 
## Call:
## lm(formula = Price2014 ~ NumFullBaths + Distance * SquareFeet, 
##     data = RailsTrails)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -163.563  -32.212   -0.187   21.010  306.570 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            31.58      32.31   0.978   0.3307    
## NumFullBaths           27.47      12.24   2.243   0.0271 *  
## Distance               22.67      22.63   1.002   0.3189    
## SquareFeet            155.32      20.01   7.761 7.77e-12 ***
## Distance:SquareFeet   -30.76      16.67  -1.846   0.0679 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.62 on 99 degrees of freedom
## Multiple R-squared:  0.6836, Adjusted R-squared:  0.6708 
## F-statistic: 53.48 on 4 and 99 DF,  p-value: < 2.2e-16

compare the two models

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: Price2014 ~ Distance + SquareFeet
## Model 2: Price2014 ~ NumFullBaths + Distance * SquareFeet
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    101 433959                              
## 2     99 400711  2     33248 4.1072 0.01934 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 1

Why is a nested F-test an appropriate way to compare these two models?

all of the predictor variables in model1 (distance from bike path and size in sq ft) are contained in the set of predictor variables from model2 (distance, size, and number of bedrooms). therefore, we can compare these two models to determine if the higher number of predictors and interactions makes a better model.

Question 2

Which model is the “full” model, and which is the “reduced” model? How can you tell?

the full model is model2, because it contains all of the predictors in model1 (model1 is nested within). model1 doesn’t contain all of model2’s predictors.

Question 3

What null hypothesis does this nested F-test assess?

the bested f-test’s hypotheses are below:

\(H_0: \beta_i = 0\) for any \(\beta_i \in W\) \(H_a: \beta_i \neq 0\) for any \(\beta_i \in W\)

Question 4

Based on the results of this nested F-test, which model of home prices should preferred: Model 1 or Model 2? Why?

this model tells us that the \(SSModel_{full} - SSModel_{reduced} = 33248\), and that \(k_{full} - k_{reduced} = 2\). The F-statistic is \(4.1072\), with a significant p-value of \(0.01934\), which tells us that the changes implemented by adding the number of bedrooms and allowing the interaction between distance and square footage had a significant impact on the model’s capacity.

Question 5

Based on your model comparison above, can you tell whether adding the \(Distance \cdot SquareFeet\) interaction is responsible for the increased goodness of fit in the full model, or whether the increase is due to adding the NumFullBaths to the model?

no – we cannot determine what the cause of the change in sigificance is because of the inclusion of another predictor or because we allowed predictors to interact, because we changed both those things from the original model. thus we only know how much model2 varied from model1, and that there were two changes made, but we can’t determine if that’s because of either change with what we have so far.

Class 12 | March 6, 2024

load libraries

library(tidyverse)
library(ggplot2)

load data

fishing <- read_csv("https://wjhopper.github.io/SDS-201/data/fishing.csv",
                    name_repair = "universal") |>
  select(Year, Lake, Walleye, Channel.Catfish)

QUESTION 1

generate model

lake_model = lm(Walleye ~ Lake, data = fishing)
summary(lake_model)
## 
## Call:
## lm(formula = Walleye ~ Lake, data = fishing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3351.5  -165.9   -37.8    73.3 12053.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3351.5      130.3  25.713   <2e-16 ***
## LakeHuron        -1786.3      179.0  -9.982   <2e-16 ***
## LakeMichigan     -3175.5      184.3 -17.227   <2e-16 ***
## LakeOntario      -3313.7      188.9 -17.540   <2e-16 ***
## LakeSaint Clair  -3186.7      199.8 -15.947   <2e-16 ***
## LakeSuperior     -3268.1      179.0 -18.261   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1492 on 768 degrees of freedom
##   (97 observations deleted due to missingness)
## Multiple R-squared:  0.4043, Adjusted R-squared:  0.4004 
## F-statistic: 104.2 on 5 and 768 DF,  p-value: < 2.2e-16

visualize

# plot(emmeans(lake_model, specs = ~ Lake),
#      horizontal = FALSE,
#      ylab = "Amount of Walleye caught (1000s of lbs)")

how different is the amount of Walleye caught on average between Lake Ontario and Lake Michigan

# avg_walleye <- emmeans(lake_model, specs = ~ Lake)
# contrast(avg_walleye, method = "consec")

QUESTION 2

Fit a regression model that predicts the amount of Walleye caught based on an interaction between the lake being fished, and the amount of Channel Catfish caught

walleye_model <- lm(Walleye ~ Channel.Catfish * Lake,
                    data = filter(fishing, Lake %in% c("Erie", "Huron")))
summary(walleye_model)
## 
## Call:
## lm(formula = Walleye ~ Channel.Catfish * Lake, data = filter(fishing, 
##     Lake %in% c("Erie", "Huron")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4754.9 -1292.3  -215.3   998.1  9524.3 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3232.6884   361.7729   8.936 3.07e-16 ***
## Channel.Catfish               1.4802     0.4623   3.202 0.001595 ** 
## LakeHuron                 -1176.9270   566.3012  -2.078 0.038994 *  
## Channel.Catfish:LakeHuron    -2.6783     0.7939  -3.374 0.000895 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2430 on 195 degrees of freedom
##   (99 observations deleted due to missingness)
## Multiple R-squared:  0.2791, Adjusted R-squared:  0.268 
## F-statistic: 25.17 on 3 and 195 DF,  p-value: 8.274e-14

visualize new model

ggplot(data = filter(fishing, Lake %in% c("Erie", "Huron")),
       mapping = aes(x = Channel.Catfish, 
                     y = Walleye, 
                     color = Lake)) +
  geom_point() +
  geom_smooth(method = lm, formula = y ~ x) +
  labs(x="Amount of Channel Catfish caught (1000s of lbs)",
       y="Amount of Walleye caught (1000s of lbs)")
## Warning: Removed 99 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_point()`).

What is the difference in the average amount of Walleye caught in Lake Erie and the average amount of Walleye caught in Lake Huron when 200 thousand pounds of Channel Catfish are caught?

generate grid of values

# walleye_grid <- emmeans(walleye_model, specs = ~ Channel.Catfish * Lake,
#                         at = list(Channel.Catfish = 200))
# walleye_grid

use the contrast

# contrast(walleye_grid, method = "pairwise", by="Channel.Catfish")

Class 11 | March 1, 2024

SET UP

load libraries

library(Stat2Data)
library(regplanely)
library(ggplot2)
library(performance)

load data

data("Diamonds")
head(Diamonds)
##   Carat Color Clarity Depth PricePerCt TotalPrice
## 1  1.08     E     VS1  68.6     6693.3     7228.8
## 2  0.31     F    VVS1  61.9     3159.0      979.3
## 3  0.31     H     VS1  62.1     1755.0      544.1
## 4  0.32     F    VVS1  60.8     3159.0     1010.9
## 5  0.33     D      IF  60.8     4758.8     1570.4
## 6  0.33     G    VVS1  61.5     2895.8      955.6

EXERCISE 1

visualize data

ggplot(data = Diamonds, mapping = aes(y = TotalPrice,
                                      x = Carat)) +
  geom_point() +
  # geom_smooth(method = lm, formula = y ~ x) +
  theme_bw() 

generate model

model = lm(TotalPrice ~ Carat * PricePerCt, data = Diamonds)
summary(model)
## 
## Call:
## lm(formula = TotalPrice ~ Carat * PricePerCt, data = Diamonds)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.112023 -0.025374  0.002803  0.026619  0.080538 
## 
## Coefficients:
##                    Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)      -1.096e-02  8.836e-03 -1.240e+00   0.2158    
## Carat             6.831e-03  1.049e-02  6.510e-01   0.5153    
## PricePerCt        2.936e-06  1.629e-06  1.802e+00   0.0724 .  
## Carat:PricePerCt  1.000e+00  8.663e-07  1.154e+06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03583 on 347 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.501e+12 on 3 and 347 DF,  p-value: < 2.2e-16

assess fitness of model

check_model(model, check = c("qq", "normality", "linearity", "homogeneity"))

while there are some mild violations of the assumptions made for the model, especially as fitted values increase in magnitude, the model is mostly sound.

visualize model

regression_plane(model)

interpret model

this model has an adjusted \(R^2\) value of 1, which means a great deal of the variation is explained by our model. this model tell us that for each increase in increase in Carat by one, price increases by 1 + 6.831e-03.

Class 10 | February 28, 2024

SET UP

load libraries

library(tidyverse)
library(ggplot2)
library(dplyr)

load data

lung_volume = read.csv("https://bit.ly/lung_volume")
head(lung_volume)
##   age   fev   ht sex smoke
## 1   9 1.708 57.0   F    No
## 2   8 1.724 67.5   F    No
## 3   7 1.720 54.5   F    No
## 4   9 1.558 53.0   M    No
## 5   9 1.895 57.0   M    No
## 6   8 2.336 61.0   F    No

visualize data

ggplot(data = lung_volume, aes(x = smoke, y = fev)) +
  geom_boxplot() + 
  theme_bw()

QUESTION 1

Visualize a regression model that estimates how smoking affects FEV, taking into account a persons age. Your model should allow the effect of smoking to depend on a person’s age.

ggplot(data = lung_volume, mapping = aes(y = fev, 
                                         color = smoke,
                                         x = age)) + 
  geom_point() +
  geom_smooth(method = lm, formula = y ~ x, se = F) + 
  theme_bw()

QUESTION 2

How does age affect lung volume among non-smokers? How does age affect lung volume among smokers?

smoking_model = lm(fev ~ age * smoke, data = lung_volume)
summary(smoking_model)$coefficients
##                Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)   0.2533955 0.08265075  3.065859  2.260474e-03
## age           0.2425584 0.00833154 29.113274 6.504859e-120
## smokeYes      1.9435707 0.41428463  4.691390  3.309624e-06
## age:smokeYes -0.1627027 0.03073753 -5.293290  1.645078e-07
  • age:smokeYes is the change in slope for the smokers, indicating that for each increase in age by one, the fev value increases by -0.163 + 0.243 = 0.080.
  • age is the slope for the baseline group, aka non-smokers, indicating that for each increase in age by one, fev increases by 0.243.

QUESTION 3

Do you believe that smoking changes the relationship between age and lung volume? Support your answer using information from the regression table

yes – this model indicates that for people who smoke, as they get older their lung capacity actually begins to decrease, whereas for non-smokers, their lung capacity tends to increase as they get older. the p-values in this table indicate that the difference in the change in slope between the non- and smokers is significant (age:smokeYes = 1.645e-07).

QUESTION 4

Is there a difference in lung volume between an averaged age person who smokes and an average aged person who doesn’t smoke?

generate recentered model

the existing model is based on smokers being 0 years, but we can mean-recenter the X value to compare lung volumes when \(age - \bar{x}_{age} = 0\), which is when age = average age.

ggplot(data = lung_volume, aes(x = age - mean(age), 
                               y = fev,
                               color = smoke)) + 
  geom_point() + 
  geom_smooth(se = F, method = lm, formula = y ~ x) +
  theme_bw()

generate a new model

lung_volume = mutate(lung_volume,
                     mc_age = age - mean(age))

mc_lung_model = lm(fev ~ mc_age * smoke, data = lung_volume)
summary(mc_lung_model)$coefficients
##                   Estimate Std. Error    t value      Pr(>|t|)
## (Intercept)      2.6622898 0.02305217 115.489795  0.000000e+00
## mc_age           0.2425584 0.00833154  29.113274 6.504859e-120
## smokeYes         0.3277391 0.12861470   2.548225  1.105618e-02
## mc_age:smokeYes -0.1627027 0.03073753  -5.293290  1.645078e-07

this new model tells us the change in slope for average-aged smokers is 0.327 more than the change in slope for average-aged non-smokers. we can tell that difference is significant because the the p-value for that difference is 0.011 < 0.05.

Class 9 | February 23, 2024

SET UP

load libraries

library(tidyverse)
library(moderndive)
library(performance)

data wrangling

nyc_airport_weather = read.csv("https://bit.ly/nyc_airport_weather")
nyc_airport_weather = nyc_airport_weather %>%
  filter(!is.na(avg_temp))

head(nyc_airport_weather)
##   month airport avg_temp avg_humid
## 1     1     EWR 35.56216  62.12451
## 2     1     JFK 35.38555  61.66162
## 3     1     LGA 35.95927  59.15609
## 4     2     EWR 34.26332  63.33558
## 5     2     JFK 34.19246  62.55848
## 6     2     LGA 34.35612  60.68603

QUESTION 1

Visualize a parallel slopes model that uses temperature and airport to predict the humidity level

visualize new data

ggplot(data = nyc_airport_weather, mapping = aes(y = avg_humid,
                                                 x = avg_temp,
                                                 color = airport)) + 
  geom_point() + 
  geom_parallel_slopes(se = F) + 
  theme_bw()

Fit the same parallel slopes model you visualized and report the regression table

fit + report

airport_model = lm(avg_humid ~ avg_temp + airport, data = nyc_airport_weather)
summary(airport_model)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 55.9468963 2.74595125 20.374322 1.648654e-19
## avg_temp     0.1280757 0.04549216  2.815337 8.394683e-03
## airportJFK   2.2716649 1.74601374  1.301058 2.028265e-01
## airportLGA  -3.7458961 1.74800212 -2.142959 4.007585e-02

assess the fit of the model

check_model(airport_model, check = c("qq", "normality", "linearity", "homogeneity"))

from this check, we see that there are some mild assumption violations in the normality of residuals tests, but otherwise, this model is mostly appropriate.

QUESTION 2

Match each of the labeled line segments in the figure below to the coefficient in the equation that it best represents, and also note the coefficient’s fitted from the regression table.

  • A –> \(b_2\) –> 2.272
  • B –> \(b_3\) –> -3.746
  • C –> \(b_1\) –> 0.128
  • D –> none –> none

QUESTION 3

What is the estimated humidity at Newark Liberty International Airport when the average monthly temperature is zero degrees?

new_data = data.frame(avg_temp = 0, airport = "EWR")
predict(airport_model, new_data)
##       1 
## 55.9469

QUESTION 4

Imagine that a flight that was schedule to land at Newark Liberty International Airport got diverted to LaGuardia airport, and the temperature was 64 degrees. Would it be more humid or less humid at LaGuardia at Newark when the flight lands? By how much?

humidity at EWR

predict(airport_model, data.frame(avg_temp = 64, airport = "EWR"))
##        1 
## 64.14374

humidity at LGA

predict(airport_model, data.frame(avg_temp = 64, airport = "LGA"))
##        1 
## 60.39785

based on this model, we can predict that it would be less humid at LGA when the plane lands, than it would be at EWR, given that the temperature was 64 at both.

Class 8 | February 21, 2024

SET UP

load libraries

library(Stat2Data)
library(ggplot2)

data("BaseballTimes2017")

head(BaseballTimes2017)
##      Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI     NL   11      5       10      39131  203
## 2 KCR-CHW     AL    9      3        7      18137  169
## 3 MIN-DET     AL   13      5       10      29733  201
## 4 SDP-LAD     NL    7      1        6      52898  179
## 5 COL-MIA     NL    9      3       10      20096  204
## 6 CIN-MIL     NL   21      1       10      34517  235

generate model

pitchers_model = lm(Time ~ Pitchers, data = BaseballTimes2017)
summary(pitchers_model)
## 
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.21 -11.19  -5.25   5.06  31.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  124.069     23.412   5.299 0.000189 ***
## Pitchers       8.017      2.722   2.946 0.012239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared:  0.4197, Adjusted R-squared:  0.3713 
## F-statistic: 8.678 on 1 and 12 DF,  p-value: 0.01224

visualize model

ggplot(data = BaseballTimes2017, mapping = aes(x = Pitchers, 
                                               y = Time)) + 
  geom_point() +
  geom_smooth(method = lm, formula = y ~ x) + 
  theme_bw()

QUESTION 1

Extend the simple linear regression model above by including the total number of runs scored as an explanatory variable

pitcher_run_model = lm(Time ~ Pitchers + Runs, data = BaseballTimes2017)
summary(pitcher_run_model)
## 
## Call:
## lm(formula = Time ~ Pitchers + Runs, data = BaseballTimes2017)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.789 -10.682  -2.683   5.260  34.211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  134.135     21.410   6.265 6.13e-05 ***
## Pitchers       2.787      3.528   0.790   0.4462    
## Runs           3.262      1.600   2.039   0.0663 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.59 on 11 degrees of freedom
## Multiple R-squared:  0.5788, Adjusted R-squared:  0.5022 
## F-statistic: 7.558 on 2 and 11 DF,  p-value: 0.008605

Question 1A

How should you interpret the Runs coefficient in the context of these data?

The Runs coefficient tells us that, based on this sample, the model predicts that for each increase in Runs by one, and assuming the PItchers value holds constant at any given value, the time will increase by 3.262 minutes.

Question 1B:

Is there still good evidence for a linear relationship between the number of pitchers and game duration, even after taking into account the number of runs scored in a game?

No. There’s a higher correlation between the number of Runs and the duration of the games than there is between the number of pitchers and the duration of the game, as evidenced by Runs’ p-value being power than Pitchers’.

QUESTION 2

Why does the effect of Pitchers appear smaller, once you account for the number of runs scored?

This is an multiple linear regression model, meaning the coefficients of the explanatory variables, which have their own relationship, are adjusted to account for the simultaneous effects – the effect of Runs ~ Pitchers and Time ~ Runs + Pitchers. This results in the effect of Pitchers looking smaller when Runs is accounted for.

Class 5 | February 9, 2024

SET UP

load libraries and data

library(Stat2Data)
library(tidyverse)
data("MothEggs")

generate regression model

moth_model = lm(Eggs ~ BodyMass, data = MothEggs)
summary(moth_model)
## 
## Call:
## lm(formula = Eggs ~ BodyMass, data = MothEggs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -157.586  -17.187    3.162   25.790   67.960 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    24.38      45.38   0.537  0.59423   
## BodyMass       79.86      26.69   2.992  0.00492 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.75 on 37 degrees of freedom
## Multiple R-squared:  0.1948, Adjusted R-squared:  0.173 
## F-statistic:  8.95 on 1 and 37 DF,  p-value: 0.004916

visualize distribution

ggplot(data = MothEggs, mapping = aes(x = BodyMass, y = Eggs)) +
  geom_point() + 
  geom_smooth(method = lm, formula = y ~ x) + 
  scale_x_continuous(breaks = seq(1.2, 2.3, by = 0.1)) + 
  theme_bw()

EXERCISE 1

slope_coeff = 79.86

std_error = sd(MothEggs$BodyMass)

alpha = 0.05
df = nrow(MothEggs)- 2
t_crit = qt(1 - alpha/2, df)

margin_error = t_crit * std_error

lower = slope_coeff - margin_error
upper = slope_coeff + margin_error

EXERCISE 2

for a moth with a body mass of 1.3 grams, the possible range of eggs is about 100 to 150.

EXERICSE 3

if an 84% confidence interval was constructed around the fitted slope coefficient from each sample, i would predict that 0.84*200 = 164 of the intervals would capture the true slope of the Eggs ~ BodyMass regression model.

Class 4 | February 7, 2024

SET UP

library(tidyverse)

generate data

set.seed(1)
n = 20

# simulated_data = data.frame(dataset_id = rep(1:10000, each = n),
#                              x = rnorm(n, mean = 10, sd = 5)) %>%
#   mutate(y = 3 + rnorm(n*10000, mean = 0, sd = 2))

constructing the sampling distribution of the standardized slope

# extract_slope_and_std_error = function(model) {
#   reg_table = summary(model)$coefficients 
#   stats = data.frame(Slope = reg_table[2, "Estimate"],
#                      Std_Error = reg_table[2, "Std. Error"])
#   return(stats)
# }
# 
# ## generate slopes from simulated data grouped by id number, then summarize w/ slope and std error
# slopes = simulated_data %>%
#   group_by(dataset_id) %>%
#   summarise(extract_slope_and_std_error(lm(y ~ x)))
# 
# head(slopes)

EXERCISE 1

## t-value is the slope/std-error value
# slopes = slopes %>%
#   mutate(t = Slope / Std_Error)
# 
# head(slopes)

EXERCISE 2

visualize the t-statistic distribution

# ## data is slopes, put on the x-axis the t-statistics
# ggplot(slopes, aes(x = t)) + 
#   ## make a histogram with 40 bins that shows density
#   geom_histogram(mapping = aes(y = after_stat(density)),
#                  bins = 40) + 
#   ## do colors
#   geom_function(color = "blue", fun = function(x) { dt(x, df =10000-2)}) +
#   ## set the bounds for the x axis
#   scale_x_continuous("t statistics for slope", limits = c(-5, 5))

EXERCISE 3

qt() function looks at the quantiles of a distribution these distributions are symmetrical, which is why these are the +- versions of the same value

## has only 1% of results above it
qt(0.01, df = 10000-2)
## [1] -2.326721
## has 99% of all results above it
qt(0.99, df = 10000-2)
## [1] 2.326721

EXERCISE 4

violating the normality assumption

we do this by using a exponential distribution instead of a normal one

set.seed(1)

# bad_data = data.frame(dataset_id = rep(1:10000, each = n),
#                       x = rnorm(n, mean = 10, sd = 5)) %>%
#   mutate(y = 3 + rexp(n*10000, rate = 3))
# 
# head(bad_data)

generate slopes

# bad_model_slopes = bad_data %>%
#   group_by(dataset_id) %>%
#   summarize(extract_slope_and_std_error(lm(y ~ x))) %>%
#   mutate(t = Slope / Std_Error)

visualize the bad_data

slopes of exponential distribution

# ggplot(data = bad_model_slopes, mapping = aes(x = t)) + 
#   geom_histogram(aes(y = after_stat(density)), 
#                  bins = 40, fill = "red", alpha = 0.5) +
#   geom_function(color = "blue", fun = function(x) dt(x, df = n - 2)) +
#   scale_x_continuous("t statistics for slope", limits = c(-5, 5))

I would have expected this distribution to not be unimodal and mostly symmetrical – rather, because the values are exponential, I would have expected the majority of values to be further from the center, so more of a bowl shape than a hill.

EXERCISE 5

1% percentile

this value has 99% of all values in the distribution above it

# bad_model_slopes %>%
#   summarize(q = quantile(t, 0.01))

99% percentile

this value has only 1% of values in the rest of the distribution above it.

this value is closer to zero than the 0.01 quantile, implying the distribution isn’t symmetrical

# bad_model_slopes %>%
#   summarize(q = quantile(t, 0.99))

EXERCISE 6

the t-distribution should be fully symmetrical but this one is not, meaning you can’t make two-way test inferences with accuracy.

Class 3 | February 2, 2024

SET UP

load data and libraries

library(infer)
library(ggplot2)

NCbirths = read.csv("https://bit.ly/ncbirths")
head(NCbirths)
##   fage mage      mature weeks      term visits gained weight    sex smoker
## 1   22   20 younger mom    32    premie      5     40   2.69   male      1
## 2   32   32 younger mom    38 full term     10     42   8.88   male      0
## 3   34   33 younger mom    38 full term     18      6   7.06 female      0
## 4   NA   18 younger mom    42 full term     15     27   7.44   male      0
## 5   NA   22 younger mom    40 full term     18     54   6.38   male      1
## 6   43   39  mature mom    38 full term     17     40   7.50 female      0

visualize data

we are looking at the relationship between the age of the mother and the weight of the baby

plot = ggplot(data = NCbirths, mapping = aes(x = mage, y = weight)) +
  geom_point() + 
  geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
  theme_bw()

plot

use randomization distribution to test

the actual slope of the distribution

observed_slope = NCbirths %>%
  specify(weight ~ mage) %>%
  calculate(stat = "slope")

generate the random distribution using permutations

set.seed(54321)

permutation_dist = NCbirths %>%
  specify(weight ~ mage) %>%                    ## variables and relationship?
  hypothesize(null = "independence") %>%        ## set the null hypothesis there's no relationship
  generate(reps = 1000, type = "permute") %>%   ## generate the random distributions
  calculate(stat = "slope")                     ## calculate the slope of each distribution

visualize the random distribution

visualize(permutation_dist) +
  shade_p_value(obs_stat = observed_slope, direction = "both")
## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?

generate a p-value

p_value = permutation_dist %>%
  get_p_value(obs_stat = observed_slope, direction = "both")

p_value
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.072

EXERCISE 1

the slope of the line in figure 1 (first visualization) is the slope between the original distribution, aka observed_slope:

## Response: weight (numeric)
## Explanatory: mage (numeric)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 0.0134

EXERCISE 2

Using the visualize() function displays a histogram. Ignoring any red shaded regions, what does this histogram itself represent?

this visualization shows the distribution of the slope values for the 1000 random permutations of the dataset. so it shows the number of distributions (y-axis) with a certain slope value (x-axis).

EXERCISE 3

What are the chances of observing a slope as extreme or more extreme as the one observed in these data, if the mother’s age and the babies birth weight are truly unrelated to one another?

given the mother’s age has no impact on the weight of the baby, the chances of observing a slope as extreme as the one observed in the data is the p-value of the original data in the randomization distribution, aka p_value

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.072

EXERCISE 4

according to this model, there is a non-significant probability that the mother’s age has a correlation with the baby’s weight.

Class 2 | January 31, 2024

SET UP

load libraries

library(Stat2Data)
library(ggplot2)
library(broom)
library(equatiomatic)

load data

## load data
data("BaseballTimes2017")
head(BaseballTimes2017)
##      Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI     NL   11      5       10      39131  203
## 2 KCR-CHW     AL    9      3        7      18137  169
## 3 MIN-DET     AL   13      5       10      29733  201
## 4 SDP-LAD     NL    7      1        6      52898  179
## 5 COL-MIA     NL    9      3       10      20096  204
## 6 CIN-MIL     NL   21      1       10      34517  235

generate visualization

plot = ggplot(BaseballTimes2017, mapping = aes(x = Pitchers, 
                                               y = Time)) + 
  geom_point() +
  theme_bw() + 
  geom_smooth(method = lm, se = FALSE, formula = y ~ x)

plot

EXERCISE 1

a: Fit the regression model visualized above

baseball_model = lm(Time ~ Pitchers, BaseballTimes2017)

b: Print the model summary

summary(baseball_model)
## 
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.21 -11.19  -5.25   5.06  31.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  124.069     23.412   5.299 0.000189 ***
## Pitchers       8.017      2.722   2.946 0.012239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared:  0.4197, Adjusted R-squared:  0.3713 
## F-statistic: 8.678 on 1 and 12 DF,  p-value: 0.01224

c: Interpret the model’s slope coefficient in a sentence.

This model indicates that for each increase in pitchers by one during a game, the duration of the game tends to last 8.017 minutes more.

EXERCISE 2

Use the fitted model’s equation to compute the residual error for Game #10, between the Los Angeles Angels and the Seattle Mariners, a game where 9 pitchers were used (feel free to use pen and paper when working with the equation). Finally, check your work using the augment() function from the broom() package to inspect the precise residual error.

estimate on your own

isolate the coefficients

baseball_coefs = coef(baseball_model)
baseball_coefs
## (Intercept)    Pitchers 
##  124.068966    8.017241

generate the estimated time

## Game 10 time = slope * # of pitchers + error
g10_time_est = 8.017241*9 + 124.068966
g10_time_est
## [1] 196.2241

calculate the residual

actual time:

## access baseball times at row 10, column 7 (game 10, time)
g10_time = BaseballTimes2017[10, 7]
g10_time
## [1] 184

calculate the residual by subtracting the estimate from the actual value:

g10_resid = g10_time - g10_time_est
g10_resid
## [1] -12.22413

check your work

calculate the residuals for all rows from the computer’s model

residuals = augment(baseball_model)

isolate game 10’s actual residual

## # A tibble: 1 × 1
##   .resid
##    <dbl>
## 1  -12.2